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ABSTRACT 

How can we correlate neural activity in the human brain as it re- 
sponds to words, with behavioral data expressed as answers to ques- 
tions about these same words? In short, we want to find latent vari- 
ables, that explain both the brain activity, as well as the behavioral 
responses. We show that this is an instance of the Coupled Matrix- 
Tensor Factorization (CMTF) problem. We propose Scoup-Smt, 
a novel, fast, and parallel algorithm that solves the CMTF problem 
and produces a sparse latent low-rank subspace of the data. In our 
experiments, we find that Scoup-Smt is 50-100 times faster than 
a state-of-the-art algorithm for CMTF, along with a 5 fold increase 
in sparsity. Moreover, we extend Scoup-Smt to handle missing 
data without degradation of performance. 

We apply Scoup-Smt to BrainQ, a dataset consisting of a 
(nouns, brain voxels, human subjects) tensor and a (nouns, prop- 
erties) matrix, with coupling along the nouns dimension. SCOUP- 
Smt is able to find meaningful latent variables, as well as to pre- 
dict brain activity with competitive accuracy. Finally, we demon- 
strate the generality of Scoup-Smt, by applying it on a FACE- 
BOOK dataset (users, 'friends', wall-postings); there, Scoup-Smt 
spots spammer-like anomalies. 
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1. INTRODUCTION 

How is knowledge mapped and stored in the human brain? How 
is it expressed by people answering simple questions about specific 
words? If we have data from both worlds, are we able to com- 
bine them and jointly analyze them? In a very different scenario, 
suppose we have the social network graph of an online social net- 
work, and we also have additional information about how and when 
users interacted with each other. What is a comprehensive way to 
combine those two pieces of data? Both, seemingly different, prob- 



lems may be viewed as instances of what is called Coupled Matrix- 
Tensor Factorization (CMTF), where a data tensor and matrices 
that hold additional information are jointly decomposed into a set 
of low-rank factors. 

In this work, we introduce Scoup-Smt, a fast, scalable, and 
sparsity promoting CMTF algorithm. Our main contributions are 
the following: 

• Fast, parallel & sparsity promoting algorithm: We provide 
a novel, scalable, and sparsity inducing algorithm, SCOUP- 
Smt, that jointly decomposes coupled matrix-tensor data. 
Figure [T] shows the accuracy of Scoup-Smt (compared to 
the traditional algorithm), as a function of portion of the wall- 
clock time that our algorithm took, again compared to the tra- 
ditional one. The result indicates a speedup of about 50-100 
times, while maintaining very good accuracy^ 

• Robustness to missing data: We carefully derive an improved 
version of the above algorithm which is resilient to missing 
data and performs well, even with a large portion of the en- 
tries missing. 

• Effectiveness & Knowledge Discovery: We analayze BrainQ, 
a brain scan dataset which is coupled to a semantic matrix 
(see Sec. [4] for details). The brain scan part of the dataset 
consists of fMRI scans first used in 1161 . a work that first 
demonstrated that brain activity can be predictably analyzed 
into component semantic features. Here, we demonstrate a 
disciplined way to combine both datasets and carry out a 
variety of data mining/machine learning tasks, through this 
joint analysis. 

• Generality: We illustrate the generality of our approach, by 
applying Scoup-Smt to a completely different setting of a 
time-evolving social network with side information on user 
interactions, demonstrating Scoup-Smt's ability to discover 
anomalies. 

2. PRELIMINARIES 

2.1 Introduction to Tensors 

Matrices record dyadic properties, like "people recommending 
products". Tensors are the n-mode generalizations, capturing 3- 
and higher-way relationships. For example "subject-verb-object" 
relationships, such as the ones recorded by the Read the Web - 
NELL project 1 1 1 (and have been recently used in this context 1101 

'Accuracy or relative cost is defined in Section|5]as the ratio of the 
squared approximation error of S COUP-Smt, divided by that of the 
traditional ALS algorithm. 
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Figure 1: The relative cost of Scoup-Smt (with respect to the 
ALS algorithm) as a function of the fraction of the wall-clock time 
of ALS that the computation required, vividly demonstrates the 
gains of Scoup-Smt in terms of speedup. In particular, for the en- 
tire BrainQ dataset which is very dense (see Sec. |4), the speedup 
incurred by the parallel version of Scoup-Smt on 4 cores, was in 
the range of 50-100 times. This Figure also shows the behavior of 
Scoup-Smt with respect to the sampling parameter s. As s in- 
creases, Scoup-Smt runs faster but the relative cost increases as 
well. 
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CMTF 
ALS 

X, X, X, X 
A0B 
A * B 
At 

l|A|k 

a o b o c 

(i) as superscript 
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I 

x(2:) 



Description 

Coupled Matrix-Tensor Factorization 

Alternating Least Squares 

scalar, vector, matrix, tensor (respectively) 

Khatri-rao product (see 1 12|). 

Hadamard (elementwise) product. 

Pseudoinverse of A (see Sec. [2j 

Frobenius norm of A. 

(a o b o c) k) = a(i)b(j)c(fc) 

Indicates the i-th iteration 

series of matrices or vectors, indexed by i. 

i-th mode unfolding of tensor X (see UTI ). 

Set of indices. 

Spanning indices X of x. 



Table 1: Table of symbols 

II9I ) naturally lead to a 3-mode tensor. In this work, our working 
example of a tensor has three modes. The first mode contains a 
number of nouns; the second mode corresponds to the brain activ- 
ity, as recorded by an fMRI machine; and the third mode identifies 
the human subject corresponding to a particular brain activity mea- 
surement. 

Earlier f 191 we introduced a scalable and parallelizable tensor 
decomposition which uses mode sampling. In this work, we fo- 
cus on a more general and expressive framework, that of Coupled 
Matrix-Tensor Factorizations. 

2.2 Coupled Matrix-Tensor Factorization 

Oftentimes, two tensors, or a matrix and a tensor, may have one 
mode in common; consider the example that we mentioned earlier, 
where we have a word by brain activity by human subject tensor, 
we also have a semantic matrix that provides additional information 
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Figure 2: PARAFAC decomposition of a three-way ten.sor of a brain ac- 
tivity tensor as sum of F outer products (rank-one tensors), reminiscing of 
the rank-F singular value decomposition of a matrix. Each component cor- 
responds to a latent concept of, e.g. "insects", "tools" and so on, a set of 
brain regions that are most active for that particular set of words, as well as 
groups of persons. 

for the same set of words. In this case, we say that the matrix and 
the tensor are coupled in the 'subjects' mode, 
persons 
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Figure 3: Coupled Matrix - Tensor example: Tensors often share one or 
more modes (with thick, wavy line): X is the brain activity tensor and Y 
is the semantic matrix. As the wavy hne indicates, these two datasets are 
coupled in the 'word' dimension. 

In this work we focus on three mode tensors, however, every- 
thing we mention extends directly to higher modes. In the general 
case, a three mode tensor X may be coupled with at most three ma- 
trices Yi, i = 1 ■ ■ ■ 3, in the manner illustrated in Figure |3]for one 
mode. The optimization function that encodes this decomposition 
is: 
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(1) 



||Yi - AD'^lll 



||Y2-BE^||^. + ||Y3-CG^||^. 

where a^ is the fc-th column of A. The idea behind the coupled 
matrix-tensor decomposition is that we seek to jointly analyze X 
and Yi, decomposing them to latent factors who are coupled in 
the shared dimension. For instance, the first mode of X shares the 
same low rank column subspace as Yi; this is expressed through 
the latent factor matrix A which jointly provides a basis for that 
subspace. 

2.3 The Alternating Least Squares Algorithm 

One of the most popular algorithms to solve PARAFAC (as in- 
troduced in Figure |2) is the so-called Alternating Least Squares 
(ALS); the basic idea is that by fixing two of the three factor matri- 
ces, we have a least squares problem for the third, and we thus do 
so iteratively, alternating between the matrices we fix and the one 
we optimize for, until the algorithm converges, usually when the 
relative change in the objective function between two iterations is 
very small. 

Solving CMTF using ALS follows the same strategy, only now, 
we have up to three additional matrices in our objective. For in- 
stance, when fixing all matrices but A, the update for A requires 
to solve the following least squares problem: 



min l|X(i) 



(B0C)A' 
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Yi - DA^ 



In Algorithm [T] we provide a detailed outline of the ALS algo- 
rithm for CMTF. 

In order to obtain initial estimates for matrices A, B, C we take 
the PARAFAC decomposition of X- As for matrix D (and simi- 
larly for the rest), it suffices to solve a simple Least Squares prob- 
lem, given the PARAFAC estimate of A, we initialize as D = 
Yi (A^) , where f denotes the Moore-Penrose Pseudoinverse which, 
given the Singular Value Decomposition of a matrix X = US , 
is computed as X^ = VS^^'^U^. 

Algorithm 1: Alternating Least Squares Algorithm for CMTF 
Input: X of size I x J x K, matrices Yi, i = 1 ■ • ■ 3, of size 
I X I2, J X J2, and K x K2 respectively, number of factors 
F. 

Output: A of size / x F, b of size J x F, c of size K x F,D 
of size I2 X F,G of size J2 x F, E of size K2 x F. 

1: Unfold X into X(i),X(2),X(3) fsee ilU). 

2: Initialize A, B, C using PARAFAC of X. Initialize D, G, E 
as discussed on the text. 

3: while convergence criterion is not met do 
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D = Yi 
end while 
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Besides ALS, there exist other algorithms for CMTF. For ex- 
ample, (4) uses a first order optimization algorithm for the same 
objective. However, we chose to operate using ALS because it is 
the 'workhorse' algorithm for plain tensor decomposition, and it 
easily to incorporate additional constraints in ALS. Nevertheless, 
one strength of Scoup-Smt is that it can be used as-is with any 
underlying core CMTF implementation. 



3. PROPOSED METHOD 



3.1 Algorithm description 

There are three main concepts behind Scoup-Smt (outlined in 
Algorithm!!}: 

Phase 1 Sample the data in order to reduce the dimensionality 
Phase 2 fit CMTF to the reduced data (possibly on more than one 

samples) 
Phase 3 merge the partial results 

Phasel: Sampling An efficient way to reduce the size of the dataset, 
yet operate on a representative subset thereof is to use biased sam- 
pling. In particular, given a three-mode tensor X we sample as 
follows. We calculate three vectors as shown in equation (|2j, one 
for each mode of X. These vectors, which we henceforth refer to 
as density vectors are the marginal absolute sums with respect to 
all but one of the modes of the tensor, and in essence represent the 
importance of each index of the respective mode. We then sam- 
ple indices of each mode according to the respective density vector. 
For instance, assume an / x J x A' tensor; suppose that we need 
a sample of size - of the indices of the first mode. Then, we just 



as the probability of sampling the i-th index of the first mode, and 
we simply sample without replacement from the set {1 ■ ■ ■ J}, us- 
ing Pi as bias. The very same idea is used for matrices Yi. Doing 
so is preferable over sampling uniformly, since our bias makes it 
more probable that high density indices of the data will be retained 
on the sample, and hence, it will be more representative of the en- 
tire set. 

Suppose that we call X, ^7, /C the index samples for the three 
modes of X. Then, we may take Xj, = J ■, IC) (and similarly 
for matrices Yi); essentially, what we are left with is a small, yet 
representative, sample of our original dataset, where the high den- 
sity blocks are more likely to appear on the sample. It is important 
to note that the indices of the coupled modes are the same for the 
matrix and the tensor, e.g. I randomly selects the same set of in- 
dices for X and Yi . This way, we make sure that the coupling is 
preserved after sampling. 

Phase 2: Fit CMTF to reduced data Having said that, the key idea 
of our proposed algorithm is to run ALS CMTF (Algorithm [T] on 
the sample and then, based on the sampled indices, redistribute the 
result to the original index space. In more detail, suppose that As 
is the factor matrix obtained by the aforementioned procedure, and 
that jointly describes the first mode of X^ and Yi.s. The dimen- 
sions of As are going to be \X\ x F (where 1 1 denotes cardinality 
and F is the number of factors). Let us further assume matrix A of 
size I X F which expresses the first mode of the tensor and the ma- 
trix, before sampling; due to sampling, it holds that I ^ \I\. If we 
initially set all entries of A to zero and we further set A{X, :) — A3 
we obtain a highly sparse factor matrix whose non-zero values are 
a 'best effort' approximation of the true ones, i.e. the values of the 
factor matrix that we would obtain by decomposing the full data. 

So far, we have provided a description of the algorithm where 
only one repetition of sampling is used. However, if our sample 
consists of only a small portion of the data, inevitably, this will 
not be adequate in order to successfully model all variation in the 
data. To that end, we allow for multiple sampling repetitions in 
our algorithm, i.e. extracting multiple sample tensors X^ and side 
matrices Yi^s, fitting a CMTF model to all of them and combining 
the results in a way that the true latent patterns are retained. We 
are going to provide a detailed outline of how to carry the multi- 
repetition version of Scoup-Smt in the following. 

While doing multiple repetitions, we keep a common, subset of 
indices for all different samples. In particular, let p be the per- 
centage of common values across all repetitions and Ip denote the 
common indices along the first mode (same notation applies to the 
rest of the indices); then, all sample tensors X^ will definitely con- 
tain the indices Ip on the first mode, as well as (1 — p)§ indices 
sampled independently (across repetitions) at random. This com- 
mon index sample is key in order to ensure that our results are not 
rank deficient, and all partial results are merged correctly. 

We do not provide an exact method for choosing p, however, as 
a rule of thumb, we observed that, depending on how sparse and 
noisy the data is, a range of p between 0.2 and 0.5 works well. 
This introduces a trade-off between redundancy of indices that we 
sample, versus the accuracy of the decomposition; since we are not 
dealing solely with tensors, which are known to be well behaved in 
terms of decomposition uniqueness, it pays off to introduce some 
data redundancy (especially when Scoup-Smt runs in a parallel 
system) so that we avoid rank-deficiency in our data. 

Let r be the number of different sampling repetitions, resulting 
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in r different sets of sampled matrix-tensor couples Xi'' and Y^'^ 
(i — 1 • ■ ■ r, j = 1 • • ■ 3). For that set of coupled data, we fit a 
CMTF model, using Algorithm[T] obtaining a set of factor matrices 
a'*' (and likewise for the rest). 

Phase 3: Merging partial results After having obtained these r 
different sets of partial results, as a final step, we have to merge 
them together into a set of factor matrices that we would ideally 
get had we operated on the full dataset. 

In order to make the merging work, we first introduce the follow- 
ing scaling on each column of each factor matrix: Let's take A^*' 
for example; we normalize each column of A by the £2 norm of 
the common part, as described in line 8 of Algorithmic] By doing 
so, the common part of each factor matrix (for all repetitions) will 
be unit norm. This scaling is absorbed in a set of scaling vectors 
Xa (and accordingly for the rest of the factors). The new objective 
function is shown in Equation |5] 
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||Y2-Bdiag(As*Ai3)E^ 
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A problem that is introduced by caiTying out multiple sampling 
repetitions is that the correspondence of the output factors of each 
repetition is very likely to be distorted. In other words, say we have 
matrices A^^' and A'^'^' and we wish to merge their columns (i.e. 
the latent components) into a single matrix A, by stitching together 
columns that correspond to the same component. It might very well 
be the case that the order in which the latent components appear in 
A'^' is not the same as in A'^'. 

The sole purpose of the aforementioned normalization is to re- 
solve the correspondence problem. In Algorithm |3] we merge the 
partial results while establishing the correct correspondence of the 
columns. Theoretical intuition as to why this is possible follows as 
a proof sketch: 

Following the example of r = 2 of the previous paragraph, ac- 
cording to Algorithmic] we compute the inner product of the com- 
mon parts of each column of A'^' and A'^'. Since the common 
parts of each column are normalized to unit norm, then the inner 
product of the common part of the column of A*^^' with that of 
A*^' will be maximized (and exactly equal to 1) for the match- 
ing columns, and by the Cauchy-Schwartz inequality, for all other 
combinations, it will be less than 1. 



3.2 Speeding up the core of the algorithm 

In addition to our main contribution in terms of speeding up the 
decomposition, i.e. Algorithmic] we are able to further speed the 



algorithm up, by making a few careful interventions to the core 
algorithm (Algorithm [T}. 

Lemma 1 . We may do the following simplification to each pseu- 
doinversion step of the ALS algorithm (Algorithm\I}: 
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Proof. For the Moore-Penrose pseudoinverse of the Khatri- 
Rao product, it holds that [7J, [14] 

(A B)^ = (^A^A * B^b) ^ (A B)^ 

Furthermore \J \ 

(A B)^ (A B) = A^A * B^B 



Pi 
P2 



For a partitioned matrix P — 

may be written in the following form ||9] 
1 1 



it holds that its pseudoinverse 
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Pi, 



Putting things together, it follows: 
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[(A0B) 



which concludes the proof. □ 



The above lemma implies that substituting the naive pseudoin- 

version of ^j^^ with the simplified version, offers significant 

computational gains to Algorithm [T] and hence to Algorithm |2] 
More precisely, if the dimensions of A, B and M are I x R, J x R 
and I X I2, then computing the pseudoinverse naively would cost 
O (i?^ (/J + 12)), whereas our proposed method yields a cost of 
O (i?^ {I + J + I2)) because of the fact that we are pseudoinvert- 
ing only a small Rx R matrix. We have to note here that in almost 
all practical scenarios R <C I ,,J,l2- 

3.3 Accounting for missing values 

In many practical scenarios, we often have corrupted or missing 
data. For instance, when measuring brain activity, a few sensors 
might stop working, whereas the majority of the sensors produce 
useful signal. Despite these common data imperfections, it is im- 
portant for a data mining algorithm to be able to operate. 

We carefully ignore the missing values from the entire optimiza- 
tion procedure: Notice that is not the same as simply zeroing out 
all missing values, since might have a valid physical interpreta- 
tion. Specifically, we define a 'weight' tensor W which has '0' in 
all coefficients where values are missing, and 'I' everywhere else. 
Similarly, we introduce three weight matrices Wi for each of the 
coupled matrices Yi. Then, the optimization function of the CMTF 
model becomes 



Algorithm 2: Scoup-Smt: Fast, sparse, and parallel CMTF 
Input: Tensor X of size I x J x K, matrices Y;, i = 1 • ■ • 3, of 
size I X I2, J X J2, and K x K2 respectively, number of 
factors F, sampling factor s, number of repetitions r. 
Output: A of size / x F, b of size J x F, c of size K x F,D 
of size I2 X F,G of size J2 x F, E of size K2 x F. \a, 
As, Ac, A_D, A_B, Ag of size F x 1 which contains the scale 
of each component for each factor matrix. 
1: Initialize A, B, C, D, E, G to all-zeros. 
2: Randomly, using mode densities as bias, select a set of 
100p% (p € [0, 1]) indices Ip, Jp, ICp to be common across 
all repetitions. For example, Ip is sampled with probabilities 

with Pi (i) — XA(i)/^^ XA(i). Probabilities for the rest of 

i = l 

the modes are calculated similarly. 
3: for i = 1 ■ • r do 

{Phase 1: Sample indices} 

4: Compute densities as in equations |2l [3] El 

Compute set of indices X*-*' as random sample without 
replacement of {1 ■ ■ ■ /} of size // (s (1 — p)) with 

probability pi (j) = XA(i)/^^ Xa(i). Likewise for J ,K.„ 

Ii, I2, and I3. Set X^'-* = X U Ip. Likewise for the rest. 
5: GetXi"' = X(X''\ -^''^), Y^!,^ = Yi(j('), jj"') 



r{i) 



sample is used for coupled modes. 

{Phase: Fit the model on the sampled data) 

6: Run AlgorithmEfor X<'' and Y^} , j = 1 • ■ ■ 3 and obtain 

As, Bs, Cs, Ds, Gs, Es. 
7: A*'' (X'*' , :) = As. Likewise for the rest. 
8: Calculate the ^2 norm of the columns of the common part: 
A«(/) = II A('' (Xp, /)||2, for / = 1 • • . F. Normalize 
columns of A'*' using A^-* (likewise for the rest). Note that 
the common part of each factor will now be normalized to 
unit norm. 
9: end for 

{Phase 3: Merge partial results} 
10: A =Merge (Aj). Likewise for the rest. 
1 1 : Aa = average of A a 1 . Likewise for the rest. 
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As we show in Algorithm [T] we may solve CMTF by solving 
six least squares problems in an alternating fashion. A fortuitous 
implication of this fact is that in order to handle missing values for 
CMTF, it suffices to solve 



niin||W* (X- AB' 



III 



(6) 



where W is a weight matrix in the same sense as described a few 
lines earlier. 

On our way tackling the above problem, we first need to inves- 
tigate its scalar case, i.e. the case where we are interested only in 
B(j, /) for a fixed pair of j and /. The optimization problem may 



Algorithm 3: Merge: Given partial results of factor matrices, 
merge them correctly 

Input: Factor matrices A\ of size I x F each, and r is the 

number of repetitions, Xp: set of common indices. 
Output: Factor matrix A of size I x F. 
1: Set A = A'i) 

2: Set £ = {1 ■ ■ ■ F}, a list that keeps track of which columns 

have not been assigned yet. 
3: for i = 2 ■ ■ ■ r do 
4: for /i = 1 • • ■ F do 
5: for /2 in £ do 
6: Compute similarity 

v(/2) = (A(Xp,/2)r (aW(Xp,/0)) 
7: end for 

8: c* — arg maxc v(c) (Ideally, for the matching columns, 
the inner product should be close to 1 ; conversely, for the 
rest of the columns, it should be considerably smaller) 



A(:,c') = A«(:,/i)| 

entries of the column. 
Remove c* from list £. 
end for 
end for 



lA{;,c*) = 



, i.e. update the zero 



be rewritten as 
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|W(:, j) * X(:, j) - (W(:,i) * A(: /)) B(i,/)^ 



which is essentially a scalar least squares problem of the form: 



mm 

b 
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with solution in analytical form: b = -^^A- 

We may, thus, solve this problem of Equation |6] using element- 
wise coordinate descent, where we update each coefficient of B 
iteratively, until convergence. Therefore, with the aforementioned 
derivation, we are able to modify our original algorithm in order to 
take missing values into account. 

3.4 Parallelization 

Our proposed algorithm is, by its nature, parallelizable; in essence, 
we generate multiple samples of the coupled data, we fit a CMTF 
model to each sample and then we merge the results. By carefully 
observing Algorithmic] we can see that lines 3 to 9 may be carried 
out entirely in parallel, provided that we have a good enough ran- 
dom number generator that does not generate the very same sample 
across all r repetitions. In particular, the r repetitions are indepen- 
dent from one another, since computing the set of common indices 
(line 2), which is the common factor across all repetitions, is done 
before line 3. 

4. KNOWLEDGE DISCOVERY 

4.1 Scoup-SMT on Brain Image Data With Ad- 
ditional Semantic Information 

As part of a larger study of neural representations of word mean- 
ings in the human brain |16|, we applied Scoup-SMT to a combi- 
nation of datasets which we henceforth jointly refer to as BrainQ. 
This dataset consists of two parts. The first is a tensor that con- 
tains measurements of the fMRI brain activity of 9 human subjects, 
when shown each of 60 concrete nouns (5 in each of 12 categories, 
e.g. dog, hand, house, door, shirt, dresser, butterfly, knife, tele- 
phone, saw, lettuce, train). fMRI measures slow changes in blood 
oxygenation levels, reflecting localized changes in brain activity. 



Here our data is made up of 3 x 3 x 6mm voxels (3D pixels) cor- 
responding to fixed spatial locations across participants. Recorded 
fMRI values are the mean activity over 4 contiguous seconds, aver- 
aged over multiple presentations of each stimulus word (each word 
is presented 6 times as a stimulus). Further acquisition and prepro- 
cessing details are given in II16I . This dataset is publicly availablfl 
The second part of the data is a matrix containing answers to 2 1 8 
questions pertaining to the semantics of these 60 nouns. A sample 
of these questions is shown in Table[2] 

This dataset has been used before in works such as 1171 . 1181 . 

BrainQ's size is 60x77775x9 with over 11 million non-zeros 
(tensor), and 60 x 218 with about 11.000 non-zeros (matrix). The 
dimensions might not be extremely high, however, the data is very 
dense and it is therefore difficult to handle efficiently. For in- 
stance, decomposing the dataset using the simple ALS algorithm 
took more than 24 hours, whereas Scoup-Smt yielded a speedup 
of 50-lOOx over this (cf Figure[T). 

Simultaneous Clustering of Words, Questions and Regions of 
the Brain 

One of the strengths of our proposed method is its expressiveness 
in terms of simultaneously soft-clustering all involved entities of 
the problem. By taking a low rank decomposition of the BrainQ 
data (using r = 5 and s/ = 3, sj = 86, sk = 1 for the tensor and 
si for the questions dimension of the matrixfl we are able to find 
groups that jointly express words, questions and brain voxels (we 
can also derive groups of human subjects; however, it is an active 
research subject in neuroscience, whether brain-scans should differ 
significantly between people, and is out of the scope of the present 
work). 

In Figure |4] we display 4 such groups of brain regions that are 
activated given a stimulus of a group of similar words; these words 
can be seen in Table |2l along with groups of similar questions that 
were highly correlated with the words of each group. Moreover, 
we were able to successfully identify high activation of the premo- 
tor cortex in Group 3, which is associated with concepts such as 
holding or picking items up. 

Nouns Questions 



Group 1 


beetle 
pants 
bee 


can it cause you pain? 
do you see it daily? 
is it conscious? 


Group 2 


bear 
cow 
coat 


does it grow? 

is it alive? 

was it ever alive? 


Group 3 


glass 

tomato 

bell 


can you pick it up? 

can you hold it in one hand? 

is it smaller than a golfball?' 


Group 4 


bed 

house 

car 


does it use electricity? 
can you sit on it? 
does it cast a shadow? 



Table 2: Groups of nouns and questions that are both positively and 
negatively correlated, and correspond to the brain regions show in 
Fig.S 

By-product: Predicting Brain Activity from Questions 

In addition to soft-clustering, the low rank joint decomposition 
of the BrainQ data offers another significant result. This low di- 
mensional embedding of the data into a common semantic space, 
enables the prediction of, say, the brain activity of a subject, for a 



^http : //www ■ cs ■ emu .edu/afs/cs/project/theo-TS/ 
^We may use imbalanced sampling factors, especially when the 
data is far from being 'rectangular' . 



given word, given the corresponding vector of question answers for 
that word. In particular, by projecting the question answer vector to 
the latent semantic space and then expanding it to the brain voxel 
space, we obtain a fairly good prediction of the brain activity. 

To evaluate the accuracy of these predictions of brain activity, 
we follow a leave-two-out scheme, where we remove two words 
entirely from the brain tensor and the question matrix; we carry 
out the joint decomposition, in some very low dimension, for the 
remaining set of words and we obtain the usual set of matrices 
A, B, C, D. Due to the randomized nature of Scoup-Smt, we 
did 100 repetitions of the procedure described below. 

Let qi be the question vector for some word i, and Vi be the 
brain activity of one human subject, pertaining to the same word. 
By left-multiplying with D"^, we project q^ to the latent space 
of the decomposition; then, by left-multiplying the result with B, 
we project the result to the brain voxel space. Thus, our estimated 
(predicted) brain activity is obtained as Vi = BD^q^ 

Given the predicted brain activities vi and V2 for the two left 
out words, and the two actual brain images vi and V2 which were 
withheld from the training data, the leave-two-out scheme mea- 
sures prediction accuracy by the ability to choose which of the ob- 
served brain images corresponds to which of the two words. After 
mean-centering the vectors, this classification decision is made ac- 
cording to the following rule: 

||vi - V1II2 + ||V2 - V2II2 < ||vi - V2II2 + ||V2 - V1II2 

Although our approach is not designed to make predictions, pre- 
liminary results are very encouraging: Using only F=2 compo- 
nents, for the noun pair closet/watch we obtained mean accuracy of 
about 0.82 for 5 out of the 9 human subjects. Similarly, for the pair 
knife/beetle, we achieved accuracy of about 0.8 for a somewhat dif- 
ferent group of 5 subjects. For the rest of the human subjects, the 
accuracy is considerably lower, however, it may be the case that 
brain activity predictability varies between subjects, a fact that re- 
quires further investigation. 

We plan detailed experiments to determine the accuracy of these 
predictions compared to specialized methods that have previously 
been used for these predictions, but which do not have the ability 
of our method to discover latent representations, such as 1 1 8 1. 

4.2 Generality: Mining Social Networks with 
Additional Information 

We have demonstrated the expressive power of Scoup-Smt for 
the BrainQ dataset, but in this subsection, we stress the fact that 
the method is actually application independent and may be used in 
vastly different scenarios. To that end, we analyze a Facebook 
dataset, introduced in |22 |3. This dataset consists of a 63890 x 
63890 X 1847 (wall, poster, day) tensor with about 740.000 non- 
zeros, and a 63890 x 63890 who is friends with whom matrix, with 
about 1.6 million non-zeros. In contrast to BrainQ, this dataset is 
very sparse (as one would expect from a social network dataset). 
However, Scoup-Smt works in both cases, demonstrating that it 
can analyze data efficiently, regardless of their density. 

We decomposed the data into 25 rank one components, using 
Si = 1000, s,j — 1000, sk ~ 100 and s/ for both dimensions 
of the matrix, and manually inspected the results. A fair amount of 
components captured normal activity of Facebook users who occa- 
sionally post on their friends' walls; here we only show one out- 
standing anomaly, due to lack of space: In Fig. [5] we show what 
appears to be a spammer, i.e. a person who, only on a certain day, 
posts on many different friends' walls: the first subfigure corre- 
wwwsf^a4$?(5,qg^^g^^|S,.|(i|igg,;ond subfigure corresponds to the 



"^Download FACEBOOK at |http ://socialnet works, mpi-sws . org/dat 




(a) Group 1 (b) Group 2 (c) Group 3 (d) Group 4 



Figure 4: The latent brain images for the 4 word/question groups as shown in Table|2] We can see that for each different group, the activation 
pattern of certain brain regions is different. For instance. Group 3 refers to small items that can be held in one hand,such as a tomato or a 
glass, and the activation pattern is very different from the one of Group 1, which mostly refers to insects, such as bee or beetle. Additionally, 
Group 3, for instance, shows high activation in the premotor cortex which is associated with the concepts of that group. 



people who post on these walls, and the third subfigure is the time 
(measured in days); we thus have one person, posting on many peo- 
ples' walls, on a single day. 
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Figure 5: This is a pattern extracted using Scoup-Smt, which 
shows what appears to be a spammer on the Facebook dataset: 
One person, posting to many different walls on a single day. 



5. EXPERIMENTS 

We implemented Scoup-Smt in Matlab. For the parallelization 
of the algorithm, we used Matlab's Parallel Computing Toolbox. 
For tensor manipulation, we used the Tensor Toolbox for Matlab 
1 6 1 which is optimized especially for sparse tensors (but works very 
well for dense ones too). All experiments were carried out on a ma- 
chine with 2 dual-core AMD Opteron 880 processors (2.4 GHz), 4 
TB disk, and 48GB ram. The parallel experiments were run on all 
4 cores, which justifies our choice of r = 4 in this case. When- 
ever we conducted multiple iterations of an experiment (due to the 
randomized nature of Scoup-Smt), we report error-bars along the 
plots. For all the following experiments we used either portions of 
the BrainQ dataset, or the whole dataset. 

5.1 Accuracy 

In Figure|6]we demonstrate that the algorithm operates correctly, 
in the sense that it reduces the model cost (Equation[T} when doing 
more repetitions. In particular, the vertical axis displays the relative 
cost. 



Scoup-Smt cost 



ALS cost (with ideal being equal to 1) and the horizon- 
tal axis is the number of repetitions in the sampling. We observed 
that for a few executions of the algorithm, the cost was not mono- 
tonically decreasing; however, we ran the algorithm 1000 times. 



keeping the executions that decreased the relative cost monotoni- 
cally and plotted them in Fig. [6] 




2 3 4 

Number of repetitions 



Figure 6: The relative cost of the model, as a function of the 
number of repetitions r is decreasing, which empirically shows 
that Scoup-Smt actually reduces the approximation error of the 
CMTF model. 



5.2 Speedup 

As we have already discussed in the Introduction, Scoup-Smt 
achieves a speedup of 50-100 on the BrainQ; for the 50 x case, 
the same approximation error of the CMTF objective is maintained, 
while for higher speedup values, the relative cost increases, but 
within reasonable range. Figure (Tjillustrates this behaviour. 

Additionally, Scoup-Smt benefits greatly from it's inherent par- 
allelizability. The parallel results we report come from r — 4 rep- 
etitions of sampling, carried out on 4 cores; had more cores been 
available, we would probably observe a higher speedup (keeping 
of course Amdahl's law in mind), while maintaining low relative 
cost, since we establish in the previous subsection that the more 
repetitions we do, the better we approximate the CMTF model. 

5.3 Sparsity 

One of the main advantages of Scoup-Smt is that, by construc- 
tion, it produces sparse latent factors for coupled matrix-tensor 
model. In Fig. |7]we demonstrate the sparsity of Scoup-Smt's re- 
sults by introducing the relative sparsity metric; this intuitive metric 



is simply the ratio of the output size of the ALS algorithm, divided 
by the output size of Scoup-Smt. The output size is simply cal- 
culated by adding up the number of non-zero entries for all factor 
matrices output by the algorithm. We use a portion of the BrainQ 
dataset in order to execute this experiment. We can see that for the 
dense BrainQ dataset, we obtained twice as sparse results. How- 
ever, in experiments with randomly generated, sparse, data, we ex- 
perienced higher degrees of sparsity, in the order of 5 x . We omit 
such plots due to space constraints. 



25 
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Figure 7: The relative output size vs. the relative cost indicates 
that, even for very dense datasets such as BrainQ, we are able 
to get a 2 fold decrease in the output size, while maintaining good 
approximation cost. 

5.4 Robustness to missing values 

In order to measure resilience to missing values we define the 

lix ip 

Signal-to-Noise Ratio (SNR) as simply as SNR = ' , 

where X„ is the reconstructed tensor when a m fraction of the 
values are missing. In Figure |8] we demonstrate the results of 
that experiment; we observe that even for a fair amount of missing 
data, the algorithm performs reasonably well, achieving high SNR. 
Moreover, for small amounts of missing data, the speed of the al- 
gorithm is not degraded, while for larger values, it is considerably 
slower, probably due to Matlab's implementation issues. However, 
this is encouraging, in the sense that if the amount of missing data 
is not overwhelming, Scoup-Smt is able to deliver a very good 
approximation of the latent subspace. This experiment was, again, 
conducted on a portion of BrainQ. 

6. RELATED WORK 

Coupled, Multi-block, Multi-set Models Coupled Matrix-Tensor 
Factorizations belong to a family of models also referred to as 
Multi-block or Multi-set in the literature. Smilde et al. in II20I pro- 
vided the first disciplined treatment of such multi-block models, in 
a chemometric context. An important issue with these models is 
how to weigh the different data blocks such that scaling differences 
may be alleviated. In |23 |, Wilderjans et al. propose and compare 
two different weighing schemes. Most related to the present work 
is the work of Acar et al. in ||4|, where a first order optimization 
approach is proposed, in order to solve the CMTF problem. As we 
mention in the Introduction, Scoup-Smt is compatible with this 
algorithm, since it provides an alternative to the core CMTF solver. 
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Figure 8: This Figure shows the Signal-to-Noise ratio (SNR)-as 
defined in the main text- as a function of the percentage of missing 
values. We can observe that, even for a fair amount of missing 
values, the SNR is quite high, signifying that Scoup-Smt is able 
to handle such ill-conditioned settings, with reasonable fidelity. 

In 1 5 1, Acar et. al apply the CMTF model, using the aforemen- 
tioned first-order approach in a chemometrics setting. In [3 |, Acar 
et. al introduce a coupled matrix decomposition, where two matri- 
ces match on one of the two dimensions, and are decomposed in 
the same spirit as in CMTF, while imposing explicit sparsity con- 
straints (via l\ norm penalties); although ScotJP-SMT also pro- 
duces sparse factors, this so happens as a fortuitous byproduct of 
sampling, whereas in |3 | an explicit sparsity penalty is considered. 
As an interesting application, in |26|, the authors employ CMTF 
for Collaborative Filtering. On a related note, 0241 , 0131 , and II5I 
introduce models where multiple tensors are coupled with respect 
to one mode, and analyzed jointly; in this work, we don't consider 
coupling of two (or more) tensors, however, we leave that for future 
work. 

Having listed an outline of relevant approaches, to the best of 
our knowledge, Scoup-Smt is the first algorithm for CMTF that 
combines speed, parallelization, as well as sparse factors. An al- 
ternative perspective on ScotJP-SMT is that of a framework that is 
able to speed up and sparsify any (possibly highly fine tuned) core 
algorithm for CMTF. 

Treating Missing Values in Tensor Decompositions Tomasi et. al 
1 21 1 provides a very comprehensive study on how to handle missing 
values for plain tensor decompositions. 

Fast & Scalable Tensor Decompositions In fT9l we introduced a 
parallel algorithm for the regular PARAFAC decomposition, where 
a sampling scheme of similar nature as here is exploited; in QIOI , 
a scalable MapReduce implementation of PARAFAC is presented. 
Additionally, the mechanics behind the Tensor Toolbox for Matlab 
(6) are very powerful when it comes to memory-resident tensors. 
Finally, in |25 1, the authors introduce a parallel framework in order 
to handle tensor decompositions efficiently. 
Tensor applications to brain data There has been substantial re- 
lated work, which utilizes tensors for this purpose, e.g. [|8j. 12|. 

7. CONCLUSIONS 

Our main contributions are the following: 
• Fast, parallel & sparsity promoting algorithm: SCOUP-Smt 



is up to 50-100 times faster than state of the art algorithms. 

• Robustness to missing data: Scoup-Smt can effectively han- 
dle missing values, without significant performance degrada- 
tion, even for moderate amounts of missing entries. 

• Effectiveness and Knowledge Discovery: Scoup-Smt, ap- 
plied to the BrainQ dataset, discovers meaningful triple- 
mode clusters: clusters of words, of questions, and of brain 
regions have similar behavior; as a by-product, Scoup-Smt 
is able to predict brain activity with very promising prelimi- 
nary results. 

• Generality: We applied S COUP- Smt to a Facebook dataset 
with additional information, identifying what appears to be a 
spammer. 
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